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Abstract 

Given genetic variations and various phenotypical traits, such as Magnetic Res- 
onance Imaging (MRI) features, we consider two important and related tasks in 
biomedical research: i)to select genetic and phenotypical markers for disease di- 
ryr) agnosis and ii) to identify associations between genetic and phenotypical data. 

rsj These two tasks are tightly coupled because underlying associations between ge- 

£ — netic variations and phenotypical features contain the biological basis for a dis- 

■ * ease. While a variety of sparse models have been applied for disease diagnosis and 

f*. canonical correlation analysis and its extensions have bee widely used in associ- 

»^ ation studies (e.g., eQTL analysis), these two tasks have been treated separately. 

^_h To unify these two tasks, we present a new sparse Bayesian approach for joint as- 

sociation study and disease diagnosis. In this approach, common latent features 
. £\ are extracted from different data sources based on sparse projection matrices and 

K^ used to predict multiple disease severity levels based on Gaussian process ordinal 

» I regression; in return, the disease status is used to guide the discovery of relation- 

C^ ships between the data sources. The sparse projection matrices not only reveal in- 

teractions between data sources but also select groups of biomarkers related to the 
disease. To learn the model from data, we develop an efficient variational expec- 
tation maximization algorithm. Simulation results demonstrate that our approach 
achieves higher accuracy in both predicting ordinal labels and discovering associa- 
tions between data sources than alternative methods. We apply our approach to an 
imaging genetics dataset for the study of Alzheimer's Disease (AD). Our method 
identifies biologically meaningful relationships between genetic variations, MRI 
features, and AD status, and achieves significantly higher accuracy for predicting 
ordinal AD stages than the competing methods. 



1 Introduction 

Recent advances in biomedical research have provided new opportunities to study dis- 
eases - for example, Alzheimer's disease (AD), the most common neurodegenerative 
disorder - from multiple data sources. For example, one data source contains genetic 
variations, such as single nucleotide polymorphisms (SNPs), which can help us under- 
stand the genetic basis of diseases. Another data source can be molecular and clini- 
cal phenotypes, such as Magnetic Resonance Imaging (MRI) data, which can reveal 
important phenotypic changes in patients. Finding associations between different data 
sources can reveal unknown biological relationships and has a wide range of applica- 
tions in computational biology [1], epidemiology [2], computational neural science [3], 
and imaging genetics [4]. In addition to the genotypes and phenotypic traits, we have 
valuable labeled information about disease stages from patient medical records. Thus 
we face a new data analysis setting where the objective is two-fold: i) finding associa- 
tions between different data sources and ii) selecting relevant (groups of) features from 
all the sources to predict ordinal disease stages. 

Many statistical approaches have been developed to discover associations or select 
features (or variables) for prediction in a high dimensional problem. For association 
studies, representative approaches are canonical correlation analysis (CCA) and its ex- 
tensions [5, 6]. These approaches treat different data sources as separate linear projec- 
tions from a common latent representation. These approaches have been widely used 
in expression quantitative trait locus (eQTL) analysis. For example, Parkhomenko et 
al. [7] applied sparse CCA (sCCA) to find relationships between genetic loci and gene 
expression levels in Utah families; Witten and Tibshirani [8] used sCCA to reveal as- 
sociations between gene expression and DNA copy variation; and Chen et al. [9] used 
structured CCA for pathway selection. For disease diagnosis based on high dimen- 
sional biomarkers, popular approaches include lasso [10], elastic net [11], and group 
lasso [12], and Bayesian automatic relevance determination [13, 14]. Here we treat 
genotypes or phenotypes as predictors (i.e., biomarkers) and the disease status as the 
response in a linear regression or classification setting. Non-zero estimated regression 
or classification weights indicate relevant biomarkers for the disease [15, 16]. 

Despite their wide success in many applications, these approaches are limited by 
the following factors: 

• Most association studies neglect the supervision from the disease status. Because 
many diseases, such as AD, are a direct result of genetic variations and often 
highly correlated to clinical traits, the disease status provides useful yet currently 
unutilized information for finding relationships between genetic variations and 
clinical traits. 

• For disease diagnosis, most sparse approaches use classification models and do 
not consider the order of disease severity. For subjects in AD studies, there is 
a natural severity order from being normal to mild cognitive impairment (MCI) 
and then from MCI to AD. Classification models cannot capture the order in 
AD's severity levels. Furthermore, the classification approaches are often based 
on conditional models {e.g., logistic regression) and ignore relationships between 
multiple views. 



• Most previous methods are not designed to handle heterogeneous data types. The 
SNPs values are discrete (and ordinal based on an additive genetic model), while 
the imaging features are continuous. Popular CCA or lasso-type methods simply 
treat both of them as continuous data and overlook the heterogeneous nature of 
the data. 

To address these problems, we propose a new Bayesian approach that unifies mul- 
tiview learning with sparse ordinal regression for joint association study and disease 
diagnosis. In the new approach, genetic variations and phenotypical traits are gener- 
ated from common latent features based on separate sparse projection matrices and 
suitable link functions and the common latent features are used to predict the disease 
status based on Gaussian process ordinal regression (See Section 2). To enforce spar- 
sity in projection matrices, we assign spike and slab priors [17] over them; these priors 
have been shown to be more effective than l\ penalty to learn sparse projection matrices 
[18, 19]. The sparse projection matrices not only reveal critical interactions between 
the different data sources but also identify groups of biomarkers in data relevant to dis- 
ease status. Finding groups of biomarkers can avoid over-sparsification (i.e., selecting 
one instead of multiple correlated features), thus boosting the accuracy for disease di- 
agnosis. It can also help provide a better biological understanding because these groups 
may form biologically units (i.e., pathways). Meanwhile, via its direct connection to 
the latent features, the disease status influences the estimation of the projection matrices 
so that it can guide the discovery of associations between heterogeneous data sources 
relevant to the disease. Hence we name this new method Supervised Heterogeneous 
Multiview Learning (SHML). 

To learn the model from data, we develop a variational Bayesian expectation max- 
imization (VB-EM) approach (See Section 3). It iteratively minimizes the Kullback 
Leibler divergence between a tractable approximation and exact Bayesian posterior 
distributions and provides an estimate to the model marginal likelihood. Maximizing 
this estimate enables us to automatically choose a suitable dimension for the latent 
features in a principled Bayesian framework. 

In Section 4, we test our approach SHML on both synthetic and real datasets. On 
synthetic data, SHML achieves both higher estimation accuracy in recovering true as- 
sociations between different views than CCA and sparse CCA, and higher prediction 
accuracy than multiple advanced alternative methods, such as the combination of CCA 
and elastic net, and Gaussian process ordinal regression [20]. We then apply SHML 
to an AD study. AD accounts for 60-80% of age-related dementia cases - one in eight 
older Americans has AD - and there is no cure for AD till now. It is believed that its 
underlying pathology precedes the onset of cognitive symptoms for many years [21]. 
Although AD studies have attracted a lot of attention from both academia and indus- 
try [22, 23], to our best knowledge, our paper presents the first (supervised) study to 
uncover associations between genotypes and phenotypic traits relevant to AD. Our re- 
sults on Alzheimer's Disease Neuroimaging Initiative (ADNI) data show that SHML 
achieves highest prediction accuracy among all the competing methods. Furthermore, 
SHML finds biologically meaningful predictive relationships between SNPs, MRI fea- 
tures, and AD status. 




Figure 1 : The graphical model of Supervised Heterogeneous Multiview Learning, where X is 
the continuous view, Z is the ordinal view, and y are the labels. 

2 Model 

First, let us describe the data. We assume there are two heterogeneous data sources: one 
contains continuous data - for example, MRI features - and one discrete ordinal data - 
for instance, SNPs. Note that we can easily generalize our model below to handle more 
views and other data types by adopting suitable link functions (e.g., a Possion model for 
count data). Given data from n subjects, p continuous features and q discrete features, 
we denote the continuous data byapxn matrix X = [xi , . . . , x„] , the discrete ordinal 
data by a q x n matrix Z = [zi, . . . , z„], and the labels (i.e., the disease status) by a 
nxl vector y = [j/i, . . . , yn] T ■ For the AD study, we let j/j = 0, 1, and 2 if the i-th 
subject is in the normal, MCI or AD condition, respectively. 

To link two data sources X and Z together, we introduce common latent features 
U = [ui, . . . , u„] and assume X and Z are generated from U by sparse projection. 
The common latent feature assumption is sensible for association studies because both 
SNPs and MRI features are biological measurements of the same subjects. Note that 
Ui is the latent feature for the i-th subject and its dimension k is estimated by evidence 
maximization. In a Bayesian framework, we give a Gaussian prior over U, p(U) — 
Y[i A/"(uj|0, I), and specify the rest of the model (see Figure 1) as follows: 

• Continuous data. Given U, X is generated from 

n 

p(X\XJ,G, v )=l[M( Xl \Gu l , v - 1 I) 

1=1 

where G = [gi,g2, ---gpl 1 ^ is a p x fc projection matrix, I is an identity ma- 
trix, and ?7 -1 I is the precision matrix of the Gaussian distribution. We assign a 
Gamma prior over 77, p(r)\ri, rq) — Gamma(^|ri, r%) where r\ and r-i are the 
hyperparameters and set to be 10~ 3 in our experiments. 

• Ordinal data. For an ordinal observation z G {0, 1, . . . , 

R — 1} its value is decided by which region an auxiliary variable c falls in 

—00 = 60 < 61 < • . . < bi{ ; = 00. 

If c falls in [b r , b r+ i), z is set to be r. For the AD study, the SNPs Z takes 
values in {0, 1, 2} and therefore R — 3. Given a q x k projection matrix H = 
[hi, h.2, ...h g ] T , the auxiliary variables C = {ctj} and the ordinal data Z are 



generated from 



p(Z,C|U,H) = JJ JJp(cij|hi,Uj)p(2»j|cij) 



where 



p(cij\hi,Uj) =J\f(cij\hJuj,l) 

2 

p(zij\cij) = ^ 6{zjj = r)£(6 r < Cij < b r+ i), 

r=0 

where 5(a) = 1 if a is true and 5(a) = otherwise. 

• Labels. The disease statuses y are ordinal variables too. To generate y, we use 
a Gaussian process ordinal regression model [20] based the latent representation 
U, 

p(y|U)=p(y|f)p(f|U), 

where 

p(f|U)=JV(f|0,K), 

2 

p(y|f) = J2 5 (V* = r W br ^ & < &r +i), 

r=0 

where ATjj = fc(u,,Uj) is the cross-covariance between u^ and u^. We can 
choose k from a rich family of kernel functions such as linear, polynomial, and 
Gaussian kernels to model relationships between the labels y and the latent fea- 
tures U. 

Note that the labels y are linked to the data X and Z via the latent features U and 
the projection matrices H and G. Due to the sparsity in H and G, essentially 
only a few groups of variables in X and Z are selected to predict y. Note that 
each of group is linked to a feature in U. 

• Sparse Priors. Because we want to identify a few critical interactions between 
different data sources, we use spike and slab prior distributions [17] to sparsify 
the projection matrices G and H. Specifically, we use a p x k matrix S g to 
represent the selection of elements in G: if Sij = 1, gij is selected and follows 
a Gaussian prior distribution with variance af; if s^ = 0, g, L j is not selected and 
forced to almost zero (i.e., sampled from a Gaussian with a very small variance 
tr|). Specifically, we have the following prior over G: 

p k 

p(Gis 9 ,n 9 ) = nn^iww) 

i=l.;=l 



where 



P(ffeK) = *5Wtotf|o,*?) + (i - ^M^-lo,^ 2 ), 

p(*y'|7r«)=7rf ? (l-7r«) 1 -i J ) 



where 7T*- 7 in II g is the probability of s' l J = 1, and of 3> c 2 ( m our experiment, 
we set a\ — \ and <j\ = lo~ 6 ). To reflect our uncertainty about n g , we assign 
a Beta hyperprior distribution: 

p k 

p(n fl |i 1 ,/ 2 ) = JIJjBeta(7r«|ii,i 2 ), 

1=1.7 = 1 

where Zi and Z 2 are hyperparameters. We set Zi = Z 2 = 1 in our experiments. 
Similarly, H is sampled from 

q k 

P (His fc ,n fc )=nnp(^i*fcM^i< j '). 
»=ij=i 

where 

P(M#) - 4W(%|0,a?) + (1 - ^)AA(^-|0,a 2 2 ), 

where S^ are binary selection variables and irfi in Ylh is the probability of s^ = 
1. Again, we use a Beta hyperprior distribution: 

q k 

p(U h \di, (fa) = ]J II Beta(7rjf Idi.efe), 
»=ij=i 

where dj and d 2 are hyperparameters. We set di = rf 2 = 1 in our experiments. 

Based on all these specifications, the joint distribution of our model, SHML, is 

p(X, Z, y, U, G, s fl) n fl) v , c, H, s h , U h , f , ) 

= p(X|U, G, ryMGI^M^in^Mngl/!, fa)p(r?|ri, r 2 ) 

• P (z, c|u, H)p(H|s h )p(s fc |n fc )p(n fc |d 1 , d 2 ) 

■p(y|f)p(f|U)p(U). (1) 

3 Estimation 

Given the model specified in the previous section, now we present an efficient, prin- 
cipled method to estimate the latent features U, the projection matrices H and G, the 
selection indicators S g and S^, the selection probabilities II 9 and II/j, the variance r\, 
the auxiliary variables C for generating ordinal data Z, and the auxiliary variables f 
for generating the labels y. In a Bayesian framework, this estimation task amounts to 
computing their posterior distributions. 

However, computing the exact posteriors turns out to be infeasible since we can- 
not calculate the normalization constant of the posteriors based on Equation (1). Thus, 
we resort to a variational Bayesian Expectation Maximization (VB-EM) approach [24]. 



More specifically, in the E step, we approximate the posterior distributions of H, G, S g , Sh, n g , 
Tlh, r/, C and f by a factorized distribution 

Q(H)Q(G)Q(S g )Q(S h )Q(n g )Q(n h )Q( v )Q(C)Q(f) 

and then use the approximate distributions to compute expectations in the M step to 
optimize the latent features U. 

To obtain the variational approximation, we minimize the Kullback-Leibler (KL) 
divergence between the approximate and the exact posteriors, KL(Q\\P) where P rep- 
resents the exact joint posterior distributions. To this end, we use coordinate descent; 
we update an approximate distribution, say, Q(H), while fixing the other approximate 
distributions, and iteratively refine all the approximate distributions. The detailed up- 
dates are given in the following paragraphs. 

3.1 Updating variational distributions for continuous data 

For the continuous data X, the approximate distributions of the projection matrix G, 
the noise variance r\, the selection indicators S g and the selection probabilities Tl g are 

p 
Q(G) = Y[N-( & ;Kn i ), (2) 

?=i 

p k 

1=1.7=1 

p k 

Q(n 9 ) = nnBetaKf,f), (4) 

1=1.7=1 

Q(rj) = Gamma(r?|fi,r 2 ). (5) 

The mean and covariance of gi are calculated as follows: 

n t = ((,)UU T + 4diag((4)) + ^diag(l - (s*))) -1 , 
a 1 a 2 

where (•) means expectation over a distribution, Xj and s* are the transpose of the i-th 
rows of X and S g , (s g ) = [j3n, . . . , (3ik] T , and (gf ■ ) is the j-th diagonal element in 

a. 

The parameter fy in Q(s l J) is calculated as 0ij = l/(l + cxp((log(l — 7T*- 7 ')) — 
(log(7r* J ')) + \ log(^i) + \{gjj){-^i — ^)))- The parameters of the Beta distribution 
Q{-k 1 J) is given by l\ J — fy + l\ and 1% = 1 — /?,j + h- The parameters of the 
Gamma distribution Q{rj) are updated as f\ = r\ + ^ and f% = ^2 + 5tr(XX T ) — 
tr((G)UX T ) + ±tr(UU T (G T G)). 



The moments required in the above distributions are calculated as (77) = &■ and 

(io g (^))=m j )-i,(i[ j +ii j ), 

(\0g(l-n^))=^{li j )-^(l[ j +li j ), 

p 



<[, 



(G) = [A 1; ...,A P ] T , (6) 



where ip(x) = 4^ lnT(x). 



3.2 Updating variational distributions for ordinal data 

For the ordinal data Z, we update the approximate distributions of the projection ma- 
trix H, the auxiliary variables C, the sparse selection indicators S/j and the selection 
probabilities Tlh- Specifically, the variational distributions of C and H are 

q k 
1=1.7 = 1 

Q(cij) oc S(b Zi] < Cij < b Zij+ i)J\f(cij\cij,l), (8) 

</ 

where c^ = "fjiij and 

A, = (UU T + -4diag((sl)) + 4 dia g((! - 4»)~\ 
a 1 a 2 

7j =A,(U(c,)), 

where c t is the transpose of the z-th row of C. 
The variational distributions of Sh and Tlh are 

q k 

Q(S h )=l[l[a°](l-a lj ) 1 -<\ (10) 

1=1.7=1 

q k 

Q(n ft )=nn Bcta (<vi i ,^), (id 

»=i j=i 

whereas = l/(l+exp((log(l-7r^))-(log(^)) + ilog(|) + i(/ 1 f j )(i-i))) ) 

d\ 3 = otij + d\, d % 2 = 1 — O-ij + d 2 , (s l h ) = [an, . . . ,a^] T , and (/if) is the j'-th 
diagonal element in Aj. 



The required moments for updating the above distributions can be calculated as 
follows: 



'-inl\ j 



(Ci) = [(Cil),...,(Ci. 

(log(l-^')>=#t)-^ +<§'), 

where $(•) is the cumulative distribution function of a standard Gaussian distribution. 
Note that in Equation (8), Q{c%j) is a truncated Gaussian and the truncation is con- 
trolled by the observed ordinal data Zij. 

3.3 Updating variational distributions for labels 

We update the variational distribution of the auxiliary variables f as follows: 

n 

Q(f) = l[Q(fi), (12) 

Q{fi) « 6(b Vi <f t < byt+iWMfi,^), (13) 

where 

/i=K i ,^K;^ i <f- nj ), (14) 

4 = K M - K^K-^K^, (15) 

where Kj _,$ is the covariance between u, and U-,j, K_,j _,j is the covariance on U^j 

(U-,i = [ui,---Ui_l,Ui + l,---U n ]), (f-,i) = K/l),--- ,{fi-l),{fi+l),--- Afn)] T , 

and each (/j) is 

(/») = /» - o-/, r ,-. 7 f • (16) 

Note that Q{f%) is also a truncated Gaussian and the truncated region is decided by the 
ordinal label yi. In this way, the supervised information from y is incorporated into 
estimation of f and then estimation of the other quantities by the recursive updates. 

3.4 Optimizing the latent representation U 

After the expectations of the other variables are calculated, we optimize U by maxi- 
mizing the following variational lower bound 

F(U) = -^tr(UU T ) + (,7)tr(X T (G)U) 

- itr«H T H)UU T ) - llog|K| - ^({ff^K" 1 ) 

- ^tr((G T G)UU T ) + tr((C) T (H)U) + constant, (17) 



where 



i' 



(H T H)=^A 4 +7 l7 7, (H) = [h l7 ...,h g ] T , (18) 

i=i 
(fT T > = (f)(f) T -diag((f) 2 ) + diag((f 2 )), (19) 

(f?) = (f i f + al 

+ <j2 (V-(/i)MVK/i),^) 

cry. ' V cry. ' 

2 (V+i-(/i)MV+iK/«),4) 

cry. ' \ (Jf. ' 

and the constant means a value independent of U so that it is irrelevant for optimizing 
U. Note that we can optimize the dimension k by maximizing the full variational lower 
bound of our model, which involves other quantities as well, such as (H) and (G). To 
save space, we do not present the long equation for the full lower bound (which can be 
easily derived based on what we have presented). 

The other required moments are given in Equations (6) and (16). We use the L- 
BFGS algorithm to maximize the cost function F over U. The gradient of U is given 
by 

f)F - - 

— = (r/)(G) T X+ (H) T (C) - (1+ (r,)(G T G) 

+ <H T H))U - ^(K- 1 - iK-^flOK- 1 )^. (21) 

Note that ^j depends on the form of the kernel function fc(iij, Uj). 

Algorithm 1 VB-EM for model estimation 

1. Initialize U, the hyperparamters, and the 

moments of all the approximate distributions. 

2. Loop until convergence: 

E-step Update all the approximate dis- 
tributions according to (2-5, 7-11, 12-13). 
M-step Use L-BFGS to optimize U. 

3. Output U and all the approximate posterior 

distributions. 



3.5 Prediction 

Let us denote the training data as £> t rain = {X tra in, Z tra in, 

ytrain} an d the test data as ©test = {X tC st, Z tost }. The prediction task needs the latent 

representation U tcs t for 2?tcst- There are two candidate strategies for obtaining Utest- 
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The first one is separate learning: we first learn the projection matrices from T> tra ,i n , 
i.e., Q(H) and Q(G), and then fix them in the variational EM procedure on T> tcs t to 
learn U tcs t- Note that there are no updates for ordinal label part on D tcs t and the terms 
regarding ordinal labels should also be removed from Equation (17) and (21). The 
second strategy is joint learning, where we carry out variational EM simultaneously on 
Strain and T> test , A drawback of the first strategy is that the (distributions of) loading 
matrices are fixed when learning latent representation U tos t- Therefore, we adopt the 
the second strategy; in other words, the variational EM algorithm uses all the data to 
update the variation distributions, except Q(f) where only labels in the training set are 
used. After both U tost and U tra ; n are obtained from the M-step, we predict the labels 
for test data as follows: 

ftost = K(Utest) UtrainjK (Utrain) Utrain) (ftrain) , (22) 

R-l 

yl cst = J2 r ■ WW < fLt < b r+ i), (23) 

where y\ cst is the prediction for i-th test sample. 



4 Related Work 

The proposed SHML model is related to a broad family of probabilistic latent vari- 
able models, including probabilistic principle component analysis [25], probabilistic 
canonical correlation analysis [26] and their extensions [27, 28, 29, 30] .They all learn 
a latent representation whose projection leads to the observed data. Recent studies on 
probabilistic factor analysis methods put more focus on the sparsity-inducing priors to 
the projection matrix. Among them, Guan et al. [27] used the Laplace prior, the Jef- 
frey's prior, and the inverse-Gaussian prior; Archambeau & Bach [29] employed the 
inverse-Gamma prior; and Virtanen et al. [30] used the Automatic Relevance Determi- 
nation(ARD) prior. Despite their success, these sparsity-inducing priors have their own 
disadvantages - they confound the degree of sparsity with the degree of regularization 
on both relevant and irrelevant variables, while in practical settings there is little reason 
that these two types of complexity control should be so tightly bounded together. Al- 
though the inverse-Gaussian prior and the inverse-Gamma prior provide more flexibil- 
ity of controlling the sparsity, they suffer from being highly sensitive to the controlling 
parameters and thus lead to unstable solutions. In contrast, our model adopts the spike 
and slab prior, which has been recently used in multi-task multiple kernel learning [31], 
sparse coding [18], and latent factor analysis [32]. Note that while our Beta priors over 
the selection indicators lead to simple yet effective variational updates, the hierarchical 
prior in [32] can better handle the selection uncertainty. Regardless what priors are 
assigned to the spike and slab models, they generally avoid the confounding issue by 
separately controlling the projection sparsity and the regularization effect over selected 
elements. 

SHML is also connected with many methods on learning from multiple sources 
or views [33]. Multiview learning methods are often used to learn a better classifier 
for multi-label classification - usually in text mining and image classification domains 
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- based on correlation structures among the training data and the labels [28, 30, 34]. 
However, in medical analysis and diagnosis, we meet two separate tasks - the asso- 
ciation discovery between genetic variations and clinical traits, and the diagnosis on 
patients. Our proposed SHML conducts these two tasks simultaneously: it employs the 
diagnosis labels to guide association discovery, while leveraging the association struc- 
tures to improve the diagnosis. In particular, the diagnosis procedure in SHML leads to 
an ordinal regression model based on latent Gaussian process models. The latent Gaus- 
sian process treatment differentiates ours from multiview CCA models [35]. Moreover, 
most multiview learning methods do not model the heterogeneous data types from dif- 
ferent views, and simply treat them as continuous data. This simplification can degrate 
the predictive performance. Instead, based on a probabilistic framework , SHML uses 
suitable link functions to fit different types of data. 

5 Experiments 

In this section, we demonstrate the effectiveness of SHML on both synthetic and real 
data for AD study. 

5.1 Simulation Study 

We first design a simulation study to examine the performance of SHML in terms of 
(i) estimation accuracy in finding associations between two views and (ii) prediction 
accuracy on ordinal labels. 

Simulation data. To generate the ground truth, we set n — 200 (200 instances), 
p = q = 40, and k = 5. We designed G, the 40 x 5 projection matrix for the continuous 
data X, to be a block diagonal matrix; each column of G had 8 elements being ones 
and the rest of them were zeros, ensuring each row with only one nonzero element. We 
designed H, the 40 x 5 projection matrix for the ordinal data Z, to be a block diagonal 
matrix; each of the first four columns of H had 10 elements being ones and the rest 
of them were zeros, and the fifth column contains only zeros. We randomly generated 
the latent representations U € R fex " with each column u^ ~ J\f(0,T), To generate 
Z, we first sampled the auxiliary variables C with each column c, ~ A/^Hu,, 1), and 
then decided the value of each element Zij in Z by the region c^ falls in - in other 
words, z^ — J2 r =a r 3(b r < Cij < b r+ i) where b = {— inf, —1, 1, inf}. Similarly, to 
generate y, we sampled the auxiliary variables f from A/"(0, U T U + I) and then each 
yi was generated by p{y. l \f l ) = S(y t = 0)S(f t < 0) + S(y t = l)S(f t > 0). 

Comparative methods. We compared SHML with several state-of-the-art meth- 
ods including (1) CCA [6], which finds the projection directions that maximize the 
correlation between two views, (2) sparse CCA [36, 8], where sparse priors are put 
on the CCA directions, and (3) Multiple Regression with lasso (MRLasso) [37] where 
each column of the second view (Z) is regarded as the output of the first view (X). We 
did not include results from the sparse probabilistic projection approach [29] because 
it performed unstably in our experiments. Regarding the software implementation, we 
used the built-in Matlab Matlab routine for CCA and the code by [36] for sparse CCA. 
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We implemented MRLasso based on the Glmnet package (c ran . r-pro ject . org/ 
web /package s/glmnet/ index . html). 

To compare accuracy on predicting labels y, we compared our method with the fol- 
lowing ordinal or multinomial regression methods: (1) lasso for multinomial regression 
[10], (2) elastic net for multinomial regression [11], (3) sparse ordinal regression with 
the splike and slab prior, (4) CCA + lasso, for which we first ran CCA to obtain the 
latent features H and then applied lasso to predict y, (5) CCA + elastic net, for which 
we first ran CCA to obtain the projection matrices and then applied elastic net on the 
projected data, (6) Gaussian Process Ordinal Regression (GPOR) [20], which employs 
Gaussian processes to learn the latent function for ordinal regression, and (7) Lapla- 
cian Support Vector Machine (LapSVM) [38], a semi-supervised SVM classification 
method. We used the Glmnet package for lasso and elastic net, the GPOR package by 
[20], and the LapSVM package by [38]. For all the methods, we used 10-fold cross 
validation to tune free parameters for each run; for example, we used extensive cross- 
validation to choose the kernel form (Gaussian or Polynomials) and its parameters (the 
kernel width or polynomial orders) for SHML, GPOR, and LapSVM. Note that all 
these methods, except SHML, stack X and Z together into one data matrix and ignore 
their heterogeneous nature. 

Because alternative methods cannot learn the dimension automatically from the 
data, for fair comparison, we provided the dimension of the latent representation to 
all the methods we tested in our simulations. For each run in our experiment, we 
partitioned the data into 10 subsets and used 9 of them for training and 1 subset for 
testing. We repeated the procedure 10 times to generate the averaged results. 

Results. To estimate linkage (i.e., interactions) between X and Z, we calculated the 
cross covariance matrix GH T . We then computed the precision and the recall based 
on the ground truth. The the precision-recall curves are shown in Figure 2. Clearly, 
our method successfully recovered almost all the links and significantly outperformed 
all the competing methods. This improvement may come from i) the use of the spike 
and slab priors, which not only remove irrelevant elements in the projection matrices 
but also avoid over-penalize the active association structures (the Laplace prior used 
in sparse CCA does over penalize the relevant ones) and ii) more importantly, the 
supervision from the labels y, which is probably the biggest difference between ours 
and the other methods for the association study. 

The prediction accuracies on unknown y and their standard errors are shown in 
Figure 3. Our proposed SHML model achieves significant improvement over all the 
other methods. In particular, it reduces the prediction error of elastic net (which ranks 
the second best) by 25%, and reduces the error of LapSVM (which ranks the last), by 
48%. Note that although utilizing the information from the unlabeled data, LapSMV 
lacks the capability to utilize underlying interaction structures and sparsify the model 
parameters, which may contribute its poor performance in the experiments. 

In summary, the simulation results confirm the power of SHML in both discovering 
true associations between heterogeneous data sources and predicting unknown labels. 
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Figure 2: The precision-recall curves for association discovery. 
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Figure 3: Prediction accuracies on the simulation data. The results are averaged over 10 runs. 

5.2 Study of Alzheimer's Disease 

We conducted association analysis and diagnosis of AD based on a dataset from Alzheimer's 
Disease Neuroimaging Initiative(ADNI). The ADNI study is a longitudinal multisite 
observational study of elderly individuals with normal cognition, mild cognitive im- 
pairment, or AD. AD is the most common form of dementia with about 30 million 
patients worldwide and payments for care are estimated to be $200 billion in 2012. . 
In this analysis, we used SHML to study the associations of genotypes and brain atro- 
phy measured by MRI and to predict the subject status (normal vs MCI vs AD). Note 
that the labels are ordinal since the three states represent increasing severity levels of 
the dementia. 

The dataset was downloaded from http : //adni . loni . ucla . ecru/, After 
removing missing data, it consists 618 subjects (183 normal, 308 MCI and 134 AD), 
and for each patient, there are 924 SNPs (selected as the top SNPs to separate nor- 
mal subjects from AD in ADNI) and 328 MRI features measuring the brain atrophies 
in different brain regions based on cortical thickness, surface area or volume using 



www. alz . org/downloads/f acts_f igures_2 012 .pdf 
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FreeSurfer software. 

We compared SHML with the alternative methods on accuracy of predicting whether 
a subject is in the normal or MCI or AD condition. We randomly split the dataset into 
556 training and 62 test samples 10 times and ran all the competing methods on each 
partition. As for the simulation study, we used the 10-fold cross validation for each 
run to tune free parameters. In SHML, in order to determine dimension k for the latent 
representation U, we computed the variational lower bounds as an approximation to 
the model marginal likelihood (i.e., evidence), with various k values {10, 20, 40, 60}. 
We chose the value with the largest approximate evidence, which led to k = 20 (see 
Figure 4). Our experiments confirmed that with k = 20, SHML achieved highest 






° -7.22 O 



-7.26 L 



o 



10 20 40 60 

Dimension of latent features 



Figure 4: The variational lower bound for the model marginal likelihood. 

prediction accuracy, demonstrating the benefit of evidence maximization. 

The accuracies for predicting unknown labels y and their standard errors are shown 
in Figure 5. Our method achieved the highest prediction accuracy, higher than that of 
the second best method, GP ordinal Regression, by 10% and than that of the worst 
method, CCA+lasso, by 22%. 
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Figure 5: The prediction accuracy with standard errors on the real data. 
We also examined the strongest associations discovered by SHML based on this 
dataset. First of all, the ranking of MRI features in terms of their prediction power 
of three different disease populations (normal, MCI and AD) demonstrate that most 
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Figure 6: The estimated associations between MRI features and SNPs. In each sub- 
figure, the MRI features are listed on the right and the SNP names are given at the 
bottom. 



of top ranked features are based on the cortical thickness measurement. On the other 
hand, the features based on volume and surface area estimation of the same brain struc- 
tures are less predictive. Particularly, thickness measurements of middle temporal lobe, 
precuneus, and fusiform were found to be most predictive compared with other brain 
regions. These findings are consistent with the memory-related function in these re- 
gions and findings in the literature for their prediction power of AD. We also found 
that measurements of the same structure on the left and right side have similar weights, 
indicating that the algorithm can automatically select correlated features in groups, 
since no asymmetrical relationship has been found for the brain regions involved in 
AD. 

Secondly, the analysis of associating genotype to AD disease prediction also gen- 
erated interesting results. Similar to the MRI features, SNPs that are in the vicinity 
of each other often listed together, indicating the group selection characteristics of the 
algorithm. For example, the top ranks SNPs are associated with a few genes including 
PSMC1P12 (proteasome 26S subunit, ATPase), NCOA2 (The nuclear receptor coacti- 
vator 2), and WDR52(WD repeat domain 52), which have been studied intensively in 
cancer research. 

At last, biclustering of the gene-MRI association, as shown in Figure 6 reveal in- 
teresting pattern in terms of the relationship between genetic variations and brain at- 
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rophy measured by structural MRI. For example, the top ranks SNPs are associated 
with a few genes including BCAR3 (Breast cancer anti-estrogen resistance protein 3) 
and NCOA2, which have been studied more carefully in cancer research. One of the 
genes associated with this set of SNPs is MATP (microtubule-associated protein tau), 
which codes the tau gene that are associated closely with the AD. These findings re- 
veal strong association between MATP gene and atrophy in the memory-related brain 
regions. Moreover, the same set of SNPs are also highly associated with cingulate, but 
in an opposite direction. These results indicate an opposite effect of genotype to the 
cingulate region, which is part of the limbic system and involve in emotion formation 
and processing, compared with other structures such as temporal lobe, which plays a 
more important role in the formation of long-term memory. 

In summary, SHML discovered synergistic predictive relationships between brain 
atrophy, genetic variations and the disease status. 

6 Conclusions 

We have presented, SHML, a new Bayesian multiview learning framework. SHML 
simultaneously finds key associations between data sources (i.e., genetic variations and 
phenotypic traits) and to predict unknown ordinal labels. Experimental results on the 
ADNI data indicate that SHML found biologically meaningful associations between 
SNPs and MRI features and led to significant improvement on predicting the ordinal 
AD stages over the alternative classification and ordinal regression methods. Although 
we have focused on the AD study, we expect that SHML, as a powerful extension 
of CCA, can be applied to a wide range of applications in biomedical research - for 
example, eQTL analysis supervised by additional labeling information. 
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